Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use power iteration for opnorm2 #49487

Closed

Conversation

oscardssmith
Copy link
Member

This is much faster and allocates O(m) instead of O(m*n)

A  = rand(1024,1024);
# before
@time opnorm(A)
  0.148451 seconds (10 allocations: 8.210 MiB)
512.4395710760909

# after
@time opnorm(A)
  0.001369 seconds (3 allocations: 16.266 KiB)
512.439571076091

@oscardssmith oscardssmith added performance Must go faster linear algebra Linear algebra labels Apr 24, 2023
@StefanKarpinski
Copy link
Member

Who would be good to review? Maybe @andreasnoack or he could recommend someone to review?

@antoine-levitt
Copy link
Contributor

Clever! Of course it's better to use lanczos for this, but maybe as a quick attempt it's acceptable. The performance with this is very unpredictable though...

You should use randn rather than rand : randn is invariant under rotations and therefore picks points on the sphere uniformly, while rand is very biased: even if you shift it to cover (-1,1), it'll eg pick (1,1,...,1) sqrt(n) more often than (1,0,..., 0).

I'm concerned about the squaring of A, which can mean that even when the algorithm converges, having a small residual does not mean having a small error. Maybe doing the power method (or at least the residual check) on (0 A';A 0) is better? I'll think about it.

@oscardssmith
Copy link
Member Author

yeah I did about 15 minutes of reading theory before choosing this algorithm and I knew that I wasn't picking the best method, but the others seemed more annoying to implement. for the performance unpredictably, do I get anything from the fact that I'm only relying on the eigenvalue rather than the eigenvector?

@antoine-levitt
Copy link
Contributor

do I get anything from the fact that I'm only relying on the eigenvalue rather than the eigenvector?

Not really...

Sorry I didn't mean unnpredictable in the sense that the convergence is unpredictable. The power method is actually one of the most predictable methods in numerical analysis: the convergence rate is the ratio of second-largest to largest singular value squared. I just meant that for some matrices you're going to have a fast runtime, and for some others it's going to be very slow, so that might trip people up. This is especially true because you're running N iterations, so eg you could very well have a situation where the iteration doesn't converge for the matrix A, but it does for the matrix [A 0;0 A] (simply because it's run for two times more iterations), which would result in the former being much slower than the latter.

All in all I'm not completely convinced about this approach. opnorm is definitely not intended for "production" use (I remember the days when norm was doing opnorm, which was super annoying), it's more for when you want to check something quickly or for teaching purposes, so its speed doesn't really matter (and it's actually rather good that it's slow, because it encourages people not to use it, as it should). If you want this quickly you usually want some estimate of the opnorm, and it's reasonable to tell people to use a superior algorithm in one of the available libraries with their own tolerances. I'm also not 100% sure about the accuracy of this method (the 100eps() check seems a bit ad hoc), or the possibility that it'll get stuck into a local maximum.

Did you actually have a need for this, or was it just for fun?

@oscardssmith
Copy link
Member Author

For the convergence issues, is there any way to determine if we are in the condition where this will converge badly from a small number of iterations?

This came from slack (https://julialang.slack.com/archives/C6A044SQH/p1682353744395679).

the 100eps() check seems a bit ad hoc)

Absolutely there definitely is something better to do there.

@antoine-levitt
Copy link
Contributor

For the convergence issues, is there any way to determine if we are in the condition where this will converge badly from a small number of iterations?

If after a few iterations the convergence rate stabilises to something but the residual remains large), it probably means the algorithm has entered the asymptotic regime, and you can more or less predict how many iterations it's going to take by extrapolating using the convergence rate. But again that's a bit ad-hoc.

The options here are:

  • Merge this, maybe with a few tweaks (eg using the [0 A;A' 0] approach]
  • Implement a Lanczos method (I don't have a lot of experience with this, but for simply estimating the largest SVD there shouldn't be that many stability issues and probably a textbook implementation is fine)
  • Keep as it is, maybe adding a comment to the doc saying that p=2 computes a svd and is slow, and that users that want this to be efficient should call a lanczos algorithm on [0 A;A' 0]

Honestly I think option 3 is better here, because of the reasons above. I just looked at what other languages are doing: matlab appears to use max(svd()) for norm, and has a separate normest that does power iteration (even though they have eigs that implements a krylov method! The fact that they haven't bothered upgrading it probably means it's not very much used). Numpy and R do it by SVD.

@ViralBShah
Copy link
Member

Also cc @stevengj

tmp = zeros(Tnorm, m)
At = A'
v = one(Tnorm)
# this will converge quickly as long as the top two eigenvalues are distinct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we use a Lanczos iteration to be more robust and converge quickly even if a few eigenvalues are similar in magnitude?

Copy link
Member

@stevengj stevengj Apr 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paper seems to be a popular algorithm: https://doi.org/10.1137/04060593X and is even fancier (restarting, bidiagonalization).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. There's a python implementation of that paper with a good license here: https://github.com/bwlewis/irlbpy/blob/master/irlb/irlb.py. I'll see how it does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good news and bad news. Good news is I have it working. Bad news is it takes about 84 lines of code. Performance testing to come.

@antoine-levitt
Copy link
Contributor

About rand vs randn (I started replying to @stevengj 's deleted comment, I think the experiment below is interesting so I'm posting it anyway):
Rand underweighs the basis axes, and I thought this would be noticeable, but

function f(n, fun)
    x = fun(n)
    x /= norm(x)
    abs(x[1])
end
println(mean([f(10000, randn) for _ = 1:100]))
println(mean([f(10000, rand) for _ = 1:100]))

yields very similar results, so apparently it's not that bad (looks like a fun exercice in probability theory, at which I'm very bad...). So using rand is probably OK here, but since the perf impact is negligible, might as well take a rotationally invariant distribution...

@stevengj
Copy link
Member

You basically want to avoid something that is nearly orthogonal to the dominant eigenvector, and the odds of that are about as negligible for rand as for randn, I think — i.e. the probability that abs(q' * rand()) ≤ ε or (q' * randn()) ≤ ε for a given small ε and an arbitrary unit vector q seem like they should be within a modest constant factor in the worst-case q (the former probability will depend on q because of the lack of rotational symmetry of rand). That's just a guess, though; it would be a nice little probability exercise to work this out precisely.

@antoine-levitt
Copy link
Contributor

@stevengj yeah, I'd have guessed a sqrt(n) discrepancy seeing as the max bias for the rand distribution is of that order, but I have a very weak intuition for high dimensional probability...

So, I'm still on the fence about this pr. First, having the same function do two very different things (svd or iterative method) feels weird (do we already have an example of that? Usual polyalgorithms simply dispatch between different exact factorizations). Second, iterative methods are more suited to sparse matrices, so do we want this function to also work for sparse? But then do we want to fallback to svd? Also it'd be good for this function to take as parameter a tolerance, for early exit of the method. All in all this points to having a second method besides the current opnorm, like matlab's normest. And third, it does look a bit specialized for LinearAlgebra, and there are already packages doing lanczos, so maybe it's better to just make sure there's a good function in one of the packages, and then point to it in the docs?

@stevengj
Copy link
Member

Second, iterative methods are more suited to sparse matrices, so do we want this function to also work for sparse? But then do we want to fallback to svd?

We could have an opnorm_estimate function that takes a tolerance, suitable for sparse matrices. But have opnorm call opnorm_estimate with a machine-precision tolerance and a dense fallback if needed.

@aravindh-krishnamoorthy
Copy link
Contributor

aravindh-krishnamoorthy commented May 2, 2023

Glad that you found some optimization hotspots :) The tough part seems to be finding the right initial x and the number of iterations niter. Seems like for niter = n, with the current method, small to medium sized matrices suffer in performance? Also:

julia> opnorm2(complex.(randn(2,2),randn(2,2)))
ERROR: InexactError: Float64(1.894212444018611 - 0.6331930071371694im)
[...]

I added a comment to this effect in the code. Have I got this wrong? Also, wasn't there some BLAS/LAPACK code for power method and related iterative methods?

@aravindh-krishnamoorthy
Copy link
Contributor

aravindh-krishnamoorthy commented May 2, 2023

Sorry, but unfortunately, I would also like to report that somehow the benchmarks on my system are different. The application is very interesting; I was attempting to figure out what the impact on small to medium sized matrices might be, since (likely) convergence will fail and they'll fall back to svdvals. I use the following code in opnorm.jl:

# Uses power iteration to compute the maximal singular value.
# falls back to svdvals if it runs into numerics issues.
function opnorm2(A::AbstractMatrix{T}) where T
    LinearAlgebra.require_one_based_indexing(A)
    m,n = size(A)
    Tnorm = typeof(float(real(zero(T))))
    if m == 0 || n == 0 return zero(Tnorm) end
    if m == 1 || n == 1 return norm2(A) end
    # to minimize the chance of x being orthogonal to the largest eigenvector
    x = randn(Tnorm, n)
    tmp = zeros(Tnorm, m)
    At = A'
    v = one(Tnorm)
    # this will converge quickly as long as the top two eigenvalues are distinct
    for i in 1:n
        mul!(x, At, mul!(tmp, A, x))
        newv = norm(x)
        # the numerics got very wonky
        !isfinite(newv) && return first(svdvals(A))
        isapprox(v, newv; rtol=10*eps(Tnorm)) && return sqrt(newv)
        v = newv
        x .*= inv(newv)
    end
    # iteration failed to converge
    return first(svdvals(A))
end

Benchmarks on my system:

julia> using LinearAlgebra
julia> using BenchmarkTools
julia> include("opnorm.jl")
opnorm2 (generic function with 1 method)
julia> A = randn(1024,1024) ;
julia> methods(opnorm2)
# 1 method for generic function "opnorm2" from Main:
 [1] opnorm2(A::AbstractMatrix{T}) where T
     @ ~/opnorm.jl:3
julia> @benchmark opnorm2(A)
BenchmarkTools.Trial: 16 samples with 1 evaluation.
 Range (min  max):  145.561 ms  395.863 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     338.414 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   330.935 ms ±  52.644 ms  ┊ GC (mean ± σ):  0.01% ± 0.03%
[...]
 Memory estimate: 16.27 KiB, allocs estimate: 3.

julia> methods(LinearAlgebra.opnorm2)
# 1 method for generic function "opnorm2" from LinearAlgebra:
 [1] opnorm2(A::AbstractMatrix{T}) where T
     @ ~/julia/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/src/generic.jl:673
julia> @benchmark LinearAlgebra.opnorm2(A)
BenchmarkTools.Trial: 34 samples with 1 evaluation.
 Range (min  max):  129.942 ms  164.007 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     151.122 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   148.022 ms ±  10.197 ms  ┊ GC (mean ± σ):  0.05% ± 0.12%
[...]
Memory estimate: 8.59 MiB, allocs estimate: 10.

Also the numerical issues:

julia> LinearAlgebra.opnorm2(complex.(randn(2,2),randn(2,2)))
1.8431336899908002

julia> opnorm2(complex.(randn(2,2),randn(2,2)))
ERROR: InexactError: Float64(0.041218366090337966 + 0.0144244121071116im)
[...]

@oscardssmith
Copy link
Member Author

I've now pushed some improvements that fix the squaring of the condition number (turns out you don't need the sqrts since this problem is self dual (A' and A have the same norm, so this procedure finds the right and left singular vectors at the same time both of which have the same singular value). The return condition is still pretty bad. For reference, here is the Lanczos version. It's a lot uglier but it does work pretty well.

function irlb(A::Matrix{T}, nu; tol=eps(T), maxit=10000) where T
    m, n   = size(A)
    if min(m,n) < 2
        S = svd(A)
        return (S.U,S.S,S.Vt,0)
    end
    
    m_b  = min(nu+7, n)  # Working dimension size
    it   = 0
    j    = 1
    k    = nu
    smax = T(-Inf)
    local fn
    local S

    V = zeros(T, n, m_b)
    W = zeros(T, m, m_b)
    F = zeros(T, n)
    B = zeros(T, m_b, m_b)
    tmp = zeros(T, m_b)

    V[:,1] .= randn(T, n) # Initial vector
    V[:,1] .*= inv(norm(V[:,1]))

    @views while it < maxit
        it > 0 && (j = k)
        mul!(W[:,j], A, V[:,j])
        if it>0
            mul!(tmp[1:j-1], W[:,1:j-1]', W[:,j])
            mul!(W[:,j], W[:,1:j-1], tmp[1:j-1], -1, 1) # W[:,j]-=W[:,1:j-1]*tmp[1:j-1]
        end
        s = norm(W[:,j])
        W[:,j] .*= inv(s)
        # Lanczos process
        while j<=m_b
            mul!(F, A', W[:,j])
            mul!(F, s, V[:,j], -1, 1) # F -= s*V[:.j]
            mul!(tmp[1:j], V[:,1:j]', F)
            mul!(F, V[:,1:j], tmp[1:j], -1, 1)
            fn = norm(F)
            F  .*= inv(fn)
            B[j,j] = s
            if j<m_b
                V[:,j+1] .= F
                B[j,j+1] = fn 
                mul!(W[:,j+1], A, V[:,j+1])
                # One step of classical Gram-Schmidt...
                mul!(W[:,j+1], fn, W[:,j], -1, 1)
                # ...with full reorthogonalization
                mul!(tmp[1:j], W[:,1:j]', W[:,j+1])
                mul!(W[:,j+1], W[:,1:j], tmp[1:j], -1, 1) # W[:,j+1] -= W[:,1:j]*tmp[1:j]
                s = norm(W[:,j+1])
                W[:,j+1] .*= inv(s)
            end
            j+=1
        end
        # End of Lanczos process
        S    = svd!(B)
        R    = fn .* S.U[end,:] # Residuals
        smax = max(S.V[1], smax)
        conv = sum(abs.(R[1:nu+1]) .< tol*smax)
        if(conv < nu)  # Not coverged yet
            k = max(conv+nu+1, k)
            k = min(k, m_b-1)
        else
            break
        end
        # Update the Ritz vectors
        V[:,1:k] .= V[:,1:m_b]*S.Vt[:,1:k]
        V[:,k] .= F 
        B .= 0
        B[diagind(B)[1:k]] .= S.S[1:k]
        B[1:k,k] .= R[1:k]
        # Update the left approximate singular vectors
        mul!(W[:,1:k], W, S.U[:,1:k])
        it += 1
    end

    U = W[:,1:m_b]*S.U[:,1:nu]
    V = V[:,1:m_b]*S.Vt[:,1:nu]
    return (U,S.S[1:nu],V,it)
end

@aravindh-krishnamoorthy
Copy link
Contributor

Hello, I'm a new guy here and I don't mean to be disrespectful to anyone.

Power algorithm has its place, but it's more a learning tool as a first step into numerical algorithms. There's a range of inputs it works well on, but I feel that it has no place in stdlib and definitely not for ::AbstractMatrix{T}. Furthermore, a fall-back to svdvals lacking convergence does not seem to be a good idea and points to the fact that we are probably using a wrong approach. Although several numerically stable variants are available, each of them addresses a specific data type and is not generic enough.

Since a lot of good work has been done in this PR, I'd suggest the follows:

  1. Use opnorm2 based on svdvals or LAPACK/BLAS as done today.
  2. In the documentation, write a note that opnorm2 may be slow and can be speeded up for specific instances using power iterations. Then, provide a curated form of this code removing the fallbacks, etc.

This is analogous to the approach taken by BLAS/LAPACK wrt power iterations. Furthermore, the method is simple and straightforward enough for a user to implement and experiment with it for their specific needs.

@DilumAluthge
Copy link
Member

We have moved the LinearAlgebra stdlib to an external repo: https://github.com/JuliaLang/LinearAlgebra.jl

@oscardssmith If you think that this PR is still relevant, please open a new PR on the LinearAlgebra.jl repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants